Sometimes long data needs to be wide, and sometimes wide data needs to be long. I'll explain.
You are soon going to discover that long before you can visualize data, you need to have it in a form that the visualization library can deal with. One of the ways that isn't immediately obvious is how your data is cast. Most of the data you will encounter will be wide -- each row will represent a single entity with multiple measures for that entity. So think of states. Your dataset could have population, average life expectancy and other demographic data.
But what if your visualization library needs one row for each measure? That's where recasting your data comes in. We can use a library called reshape2
to melt
or cast
the data, depending on what we need.
So let's transform a dataset we've already used -- registered voters in Nebraska -- from wide data to long data and back again. First, we'll import the library and then open the data.
In [1]:
library(reshape2)
In [2]:
voters <- read.csv("../../Data/registeredvoters.csv")
In [3]:
head(voters)
Making data long, in most cases, is very, very easy. It's simple. We're going to create a new data frame called longvoters, and then melt
our voters data into it. Then we'll run head
and you'll see each measure gets it's own row -- so each county has 10 rows of data for it.
In [4]:
longvoters <- melt(voters)
head(longvoters)
But one problem that isn't immediately clear is where in spreadsheet world, a header like Republican10 and Republican16 makes sense, it doesn't here. We need fields to be able to have a County, a Party, a Year and then a count of those voters. So let's so some more worth with mutate
and create those columns we need. And let's start exploring programmatic text manipulations. To simplify, we're going to use a library called stringr
to get predictable patterns in our data.
So if you look at the variable
field, you see we have the party at the front and the year in the back and the year is always the last two characters of the field. The problem? We don't know WHICH characters those are. But stringr
has a powerful tool called str_sub
that can get us that, if we know one trick: That telling it to start at -2 means go to the end of the word and move backwards two characters.
In [5]:
library(stringr)
library(dplyr)
In [6]:
longvoters %>% mutate(
Year = str_sub(variable, start= -2),
)
If I really wanted to be correct, I could make those four digit years by adding 2000 to what the str_sub finds, but I also have to turn that into a number first.
In [7]:
longvoters %>% mutate(
Year = 2000 + as.integer(str_sub(variable, start= -2)),
)
Okay, so I have one side of it. Now I need the other. Luckily, we can tell str_sub to start at a character and end at one. In this case, we want to start at 1 and end at -3.
In [8]:
longvoters %>% mutate(
Year = 2000 + as.integer(str_sub(variable, start= -2)),
Party = str_sub(variable, 1, -3),
)
Two last things that are bothering me about our data? I don't like the number of voters being called value
and we don't need variable
. So we'll use dpylr to clean this up a bit.
First we're going to use dpylr's select
and a little R trick where you can use a negative sign to mean not this.
Then, we'll use rename
, which is backwards to me. If you read it, what it says is rename a column to Count and assign it the values of value. The reverse makes sense to me. Make values = Count, but dplyr
does it the other way. If you get an error, you probably did it backwards (I did).
In [9]:
longvoters %>% mutate(
Year = 2000 + as.integer(str_sub(variable, start= -2)),
Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)
In [10]:
newlongvoters <- longvoters %>% mutate(
Year = 2000 + as.integer(str_sub(variable, start= -2)),
Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)
Then, we can put it back together again by casting it using dcast
. With dcast
, we need to tell it which variable is our main identifier -- which is County -- and what the headers should be. We also have to tell it where the numbers should come from, since we blew it all apart. And you'll see, we've changed the data substantially, but it looks almost identical to the original dataset.
In [11]:
widevoters <- dcast(newlongvoters, County ~ Party+Year, value.var = "Count")
head(widevoters)
Melt the population estimates data from assignment 3. Create long data, where each row is a single year for a single county, with columns for the state, county, year and estimate.
In [ ]: